F-test (Variance Ratio Test)#

The two-sample F-test for variances answers a simple question:

Do these two groups have the same variability?

It is also the distribution behind the F-statistic you see in ANOVA and linear regression, but in this notebook we focus on the most direct version: testing whether two population variances are equal.


Learning goals#

  • Understand what the F-test is testing (and what it is not testing)

  • Build intuition for where the F distribution comes from (chi-square → ratio)

  • Implement the test with NumPy only (Monte Carlo p-values + critical values)

  • Learn how to interpret p-values, rejection regions, and a variance-ratio confidence interval

  • See common pitfalls (especially non-normality / outliers)

import math
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

How sample size affects the F distribution#

The degrees of freedom (df₁ = n₁−1, df₂ = n₂−1) control how much the variance estimates fluctuate.

  • Small samples (small df) → wide, heavy-tailed distribution

  • Large samples (large df) → concentrated near 1 (variance estimates stabilize)

Here are a few F pdf curves to make that shape change visible.

df_pairs = [(5, 5), (10, 10), (30, 30), (5, 30)]
x_grid = np.linspace(1e-4, 5.0, 900)

fig = go.Figure()
for d1, d2 in df_pairs:
    fig.add_trace(
        go.Scatter(
            x=x_grid,
            y=f_pdf(x_grid, d1, d2),
            mode='lines',
            name=f'df1={d1}, df2={d2}',
        )
    )

fig.add_vline(x=1.0, line_dash='dash', line_color='black')
fig.update_layout(
    title='How degrees of freedom shape the F distribution',
    xaxis_title='F',
    yaxis_title='density',
)
fig.show()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 9
      4 fig = go.Figure()
      5 for d1, d2 in df_pairs:
      6     fig.add_trace(
      7         go.Scatter(
      8             x=x_grid,
----> 9             y=f_pdf(x_grid, d1, d2),
     10             mode='lines',
     11             name=f'df1={d1}, df2={d2}',
     12         )
     13     )
     15 fig.add_vline(x=1.0, line_dash='dash', line_color='black')
     16 fig.update_layout(
     17     title='How degrees of freedom shape the F distribution',
     18     xaxis_title='F',
     19     yaxis_title='density',
     20 )

NameError: name 'f_pdf' is not defined

1) What problem does the F-test solve?#

Sometimes the mean is not the main issue. You might care more about spread:

  • Does a new manufacturing process reduce variability?

  • Is sensor A noisier than sensor B?

  • Are two groups equally consistent, even if their averages match?

A variance F-test is a hypothesis test about population variances (σ²), based on sample variances (s²).

Let’s start with a picture: two groups with the same center, but different spread.

n_a, n_b = 40, 60
a = rng.normal(loc=0.0, scale=1.0, size=n_a)
b = rng.normal(loc=0.0, scale=1.8, size=n_b)

fig = go.Figure()
fig.add_trace(go.Violin(y=a, name='Group A', box_visible=True, meanline_visible=True))
fig.add_trace(go.Violin(y=b, name='Group B', box_visible=True, meanline_visible=True))
fig.update_layout(
    title='Same center, different spread',
    yaxis_title='value',
)
fig.show()

2) The hypothesis, assumptions, and what the result means#

Hypotheses#

For two independent samples (group 1 and group 2):

  • H₀: σ₁² = σ₂² (equal population variances)

  • H₁ (two-sided): σ₁² ≠ σ₂²

  • H₁ (one-sided): σ₁² > σ₂² or σ₁² < σ₂²

Assumptions (important)#

The classic F-test for variances is derived under:

  • Independence within and across the two samples

  • Normality: each group is (approximately) normally distributed

This test is very sensitive to outliers and non-normal/heavy-tailed data. If normality is doubtful, consider more robust alternatives such as Levene, Brown–Forsythe, or Fligner–Killeen.

What does rejecting H₀ mean?#

Rejecting H₀ means: the observed difference in sample variances is unlikely if the true variances were equal (given the assumptions).

Failing to reject H₀ does not prove variances are equal; it usually means the data are not strong enough to detect a difference.

3) Where does the F distribution come from?#

For a Normal sample of size n, the sample variance is:

\[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \bar{x})^2 \]

A key result (for Normal data) is:

\[ \frac{(n-1)s^2}{\sigma^2} \sim \chi^2_{n-1} \]

So if you take two independent samples (sizes n₁, n₂), then under H₀: σ₁² = σ₂², the ratio of sample variances

\[ F = \frac{s_1^2}{s_2^2} \]

follows an F distribution with degrees of freedom:

  • df₁ = n₁ - 1 (numerator)

  • df₂ = n₂ - 1 (denominator)

Let’s verify these building blocks with simulation.

def f_rvs(df1: int, df2: int, size: int, rng: np.random.Generator) -> np.ndarray:
    """Sample from an F(df1, df2) distribution using the chi-square ratio definition."""
    num = rng.chisquare(df1, size=size) / df1
    den = rng.chisquare(df2, size=size) / df2
    return num / den


def f_pdf(x: np.ndarray, df1: int, df2: int) -> np.ndarray:
    """Analytic F(df1, df2) pdf using only stdlib + NumPy (no SciPy)."""
    x = np.asarray(x, dtype=float)
    a = df1 / 2.0
    b = df2 / 2.0

    log_beta = math.lgamma(a) + math.lgamma(b) - math.lgamma(a + b)
    log_coeff = a * math.log(df1 / df2) - log_beta

    with np.errstate(divide='ignore'):
        log_pdf = (
            log_coeff
            + (a - 1.0) * np.log(x)
            - (a + b) * np.log1p((df1 / df2) * x)
        )

    pdf = np.exp(log_pdf)
    return np.where(x > 0, pdf, 0.0)


# 1) Scaled sample variance ~ chi-square (Normal data)
n = 12
sigma = 2.0
n_sim = 50_000

samples = rng.normal(loc=0.0, scale=sigma, size=(n_sim, n))
s2 = samples.var(axis=1, ddof=1)
scaled_s2 = (n - 1) * s2 / (sigma**2)

chi = rng.chisquare(df=n - 1, size=n_sim)

fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=scaled_s2,
        nbinsx=90,
        histnorm='probability density',
        name='(n-1)s^2/σ^2 from samples',
        opacity=0.6,
    )
)
fig.add_trace(
    go.Histogram(
        x=chi,
        nbinsx=90,
        histnorm='probability density',
        name='Chi-square(df=n-1)',
        opacity=0.6,
    )
)
fig.update_layout(
    title='Scaled sample variance follows a chi-square distribution (Normal data)',
    barmode='overlay',
    xaxis_title='value',
    yaxis_title='density',
)
fig.show()


# 2) Ratio of (chi-square/df) terms ~ F
n1, n2 = 10, 25
df1, df2 = n1 - 1, n2 - 1

f_sim = f_rvs(df1, df2, size=n_sim, rng=rng)
x_max = np.quantile(f_sim, 0.995)
x_grid = np.linspace(1e-4, x_max, 700)

fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=f_sim,
        nbinsx=120,
        histnorm='probability density',
        name='Simulated F',
        opacity=0.6,
    )
)
fig.add_trace(go.Scatter(x=x_grid, y=f_pdf(x_grid, df1, df2), mode='lines', name='Analytic pdf'))
fig.update_layout(
    title=f'F distribution from a chi-square ratio (df1={df1}, df2={df2})',
    barmode='overlay',
    xaxis_title='F value',
    yaxis_title='density',
)
fig.show()

4) The variance F-test: statistic, p-value, and critical region#

Given samples x and y:

  • Compute unbiased sample variances: s₁², s₂² (using ddof=1)

  • Test statistic:

\[ F_{obs} = \frac{s_1^2}{s_2^2} \]
  • Degrees of freedom: df₁ = n₁ - 1, df₂ = n₂ - 1

Interpreting F_obs#

  • If F_obs ≈ 1, the sample variances are similar.

  • If F_obs is large, group 1 looks more variable than group 2.

  • If F_obs is small (≪ 1), group 1 looks less variable.

Two-sided decision rule (α = 0.05)#

Reject H₀ if F_obs lands in either tail:

\[ F_{obs} < F_{\alpha/2}(df_1, df_2) \quad \text{or} \quad F_{obs} > F_{1-\alpha/2}(df_1, df_2) \]

In practice, the p-value is:

\[ p = 2 \cdot \min(\Pr(F \le F_{obs}),\, \Pr(F \ge F_{obs})) \]

We’ll compute these probabilities with a Monte Carlo approximation (NumPy only).

def f_test_variances(
    x: np.ndarray,
    y: np.ndarray,
    *,
    alternative: str = 'two-sided',
    alpha: float = 0.05,
    n_sim: int = 300_000,
    seed: int = 0,
) -> dict:
    """Two-sample F-test for equal variances using a Monte Carlo reference distribution.

    Parameters
    ----------
    x, y:
        1D samples.
    alternative:
        'two-sided', 'greater' (σ₁² > σ₂²), or 'less' (σ₁² < σ₂²).
    alpha:
        Significance level.
    n_sim:
        Number of Monte Carlo draws from F(df1, df2).
    seed:
        RNG seed (for reproducibility).
    """
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    if x.ndim != 1 or y.ndim != 1:
        raise ValueError('x and y must be 1D arrays.')
    if len(x) < 2 or len(y) < 2:
        raise ValueError('Need at least 2 observations per group.')
    if not (0.0 < alpha < 1.0):
        raise ValueError('alpha must be in (0, 1).')

    n1, n2 = len(x), len(y)
    df1, df2 = n1 - 1, n2 - 1

    s1_sq = x.var(ddof=1)
    s2_sq = y.var(ddof=1)
    f_obs = s1_sq / s2_sq

    rng_local = np.random.default_rng(seed)
    f_sim = f_rvs(df1, df2, size=n_sim, rng=rng_local)

    p_right = np.mean(f_sim >= f_obs)
    p_left = np.mean(f_sim <= f_obs)

    alternative_norm = alternative.strip().lower()

    if alternative_norm in {'two-sided', 'two_sided', 'two sided'}:
        alternative_canonical = 'two-sided'
        p_value = 2.0 * min(p_left, p_right)
        p_value = float(min(p_value, 1.0))

        q_low, q_high = np.quantile(f_sim, [alpha / 2.0, 1.0 - alpha / 2.0])
        critical_values = (float(q_low), float(q_high))
        reject = bool((f_obs < q_low) or (f_obs > q_high))

        # CI for variance ratio σ₁² / σ₂² (Monte Carlo quantiles)
        ci_low = float(f_obs / q_high)
        ci_high = float(f_obs / q_low)
        ci = (ci_low, ci_high)
    elif alternative_norm in {'greater', 'right', 'larger'}:
        alternative_canonical = 'greater'
        p_value = float(p_right)
        q_high = float(np.quantile(f_sim, 1.0 - alpha))
        critical_values = (q_high,)
        reject = bool(f_obs > q_high)
        ci = None
    elif alternative_norm in {'less', 'left', 'smaller'}:
        alternative_canonical = 'less'
        p_value = float(p_left)
        q_low = float(np.quantile(f_sim, alpha))
        critical_values = (q_low,)
        reject = bool(f_obs < q_low)
        ci = None
    else:
        raise ValueError("alternative must be one of: 'two-sided', 'greater', 'less'.")

    # Monte Carlo standard error for the estimated p-value (rough sense of noise)
    mc_se = float(np.sqrt(p_value * (1.0 - p_value) / n_sim))

    return {
        'n1': n1,
        'n2': n2,
        'df1': df1,
        'df2': df2,
        's1_sq': float(s1_sq),
        's2_sq': float(s2_sq),
        'f_obs': float(f_obs),
        'alternative': alternative_canonical,
        'alpha': float(alpha),
        'p_value': float(p_value),
        'mc_se': mc_se,
        'critical_values': critical_values,
        'reject_h0': reject,
        'ci_var_ratio': ci,
        'n_sim': int(n_sim),
        'seed': int(seed),
    }


def plot_f_test(result: dict, *, x_max_quantile: float = 0.997) -> go.Figure:
    df1 = int(result['df1'])
    df2 = int(result['df2'])
    f_obs = float(result['f_obs'])
    alpha = float(result['alpha'])
    alternative = result['alternative']

    rng_plot = np.random.default_rng(123)
    f_sim_plot = f_rvs(df1, df2, size=200_000, rng=rng_plot)
    x_max = float(np.quantile(f_sim_plot, x_max_quantile))
    x_max = max(x_max, f_obs) * 1.1

    x_grid = np.linspace(1e-4, x_max, 900)
    pdf = f_pdf(x_grid, df1, df2)

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=x_grid, y=pdf, mode='lines', name=f'F pdf (df1={df1}, df2={df2})'))

    # Shade rejection region(s)
    if alternative == 'two-sided':
        q_low, q_high = result['critical_values']

        left = x_grid <= q_low
        right = x_grid >= q_high
        fig.add_trace(
            go.Scatter(
                x=x_grid[left],
                y=pdf[left],
                mode='lines',
                fill='tozeroy',
                name=f'Reject (α/2={alpha/2:.3f})',
                line=dict(color='rgba(239, 85, 59, 1.0)'),
                fillcolor='rgba(239, 85, 59, 0.25)',
            )
        )
        fig.add_trace(
            go.Scatter(
                x=x_grid[right],
                y=pdf[right],
                mode='lines',
                fill='tozeroy',
                name=f'Reject (α/2={alpha/2:.3f})',
                line=dict(color='rgba(239, 85, 59, 1.0)'),
                fillcolor='rgba(239, 85, 59, 0.25)',
                showlegend=False,
            )
        )
        fig.add_vline(x=q_low, line_dash='dot', line_color='rgba(239, 85, 59, 0.9)')
        fig.add_vline(x=q_high, line_dash='dot', line_color='rgba(239, 85, 59, 0.9)')
    elif alternative == 'greater':
        (q_high,) = result['critical_values']
        right = x_grid >= q_high
        fig.add_trace(
            go.Scatter(
                x=x_grid[right],
                y=pdf[right],
                mode='lines',
                fill='tozeroy',
                name=f'Reject (α={alpha:.3f})',
                line=dict(color='rgba(239, 85, 59, 1.0)'),
                fillcolor='rgba(239, 85, 59, 0.25)',
            )
        )
        fig.add_vline(x=q_high, line_dash='dot', line_color='rgba(239, 85, 59, 0.9)')
    elif alternative == 'less':
        (q_low,) = result['critical_values']
        left = x_grid <= q_low
        fig.add_trace(
            go.Scatter(
                x=x_grid[left],
                y=pdf[left],
                mode='lines',
                fill='tozeroy',
                name=f'Reject (α={alpha:.3f})',
                line=dict(color='rgba(239, 85, 59, 1.0)'),
                fillcolor='rgba(239, 85, 59, 0.25)',
            )
        )
        fig.add_vline(x=q_low, line_dash='dot', line_color='rgba(239, 85, 59, 0.9)')
    else:
        raise ValueError('Unknown alternative in result dict.')

    fig.add_vline(x=f_obs, line_dash='dash', line_color='black')
    fig.add_annotation(
        x=f_obs,
        y=float(np.interp(f_obs, x_grid, pdf)),
        text=f"F_obs={f_obs:.3f}<br>p≈{result['p_value']:.4f}",
        showarrow=True,
        arrowhead=2,
        ax=40,
        ay=-40,
        bgcolor='rgba(255,255,255,0.9)',
    )

    fig.update_layout(
        title='F-test: reference distribution and rejection region',
        xaxis_title='F',
        yaxis_title='density',
        legend_title='Region',
    )
    return fig

5) Worked example: equal variances (H₀ true)#

We will sample two Normal groups with the same standard deviation (σ₁ = σ₂). In this situation, we expect the test to reject only about α of the time.

For a single run, the p-value can land anywhere from 0 to 1 (randomness!), but on average it should not look systematically small.

n1, n2 = 18, 22
x = rng.normal(loc=0.0, scale=1.2, size=n1)
y = rng.normal(loc=0.0, scale=1.2, size=n2)

res_equal = f_test_variances(x, y, alternative='two-sided', alpha=0.05, n_sim=400_000, seed=1)
res_equal
{'n1': 18,
 'n2': 22,
 'df1': 17,
 'df2': 21,
 's1_sq': 1.781384820492569,
 's2_sq': 1.6656962640243989,
 'f_obs': 1.0694535726391445,
 'alternative': 'two-sided',
 'alpha': 0.05,
 'p_value': 0.87165,
 'mc_se': 0.0005288579145195805,
 'critical_values': (0.3850427399035428, 2.4863470025297367),
 'reject_h0': False,
 'ci_var_ratio': (0.43013045707257586, 2.777493150259251),
 'n_sim': 400000,
 'seed': 1}
plot_f_test(res_equal).show()

6) Worked example: different variances (H₁ true)#

Now we generate two groups where the second group is noisier (larger σ). This should push F_obs away from 1 and often produce a small p-value.

n1, n2 = 18, 22
x = rng.normal(loc=0.0, scale=1.0, size=n1)
y = rng.normal(loc=0.0, scale=2.0, size=n2)

res_diff = f_test_variances(x, y, alternative='two-sided', alpha=0.05, n_sim=400_000, seed=2)
res_diff
{'n1': 18,
 'n2': 22,
 'df1': 17,
 'df2': 21,
 's1_sq': 0.7968991166284403,
 's2_sq': 4.228029458951296,
 'f_obs': 0.18848002937664016,
 'alternative': 'two-sided',
 'alpha': 0.05,
 'p_value': 0.00098,
 'mc_se': 4.947321497537834e-05,
 'critical_values': (0.38428465507232556, 2.4922479853766086),
 'reject_h0': True,
 'ci_var_ratio': (0.07562651489039465, 0.490469829822288),
 'n_sim': 400000,
 'seed': 2}
plot_f_test(res_diff).show()

7) Interpreting the output (what it exactly means)#

From the result dictionary:

  • f_obs = s1_sq / s2_sq is the observed variance ratio.

  • p_value is the probability (under H₀) of seeing an F statistic at least as extreme as the observed one.

  • reject_h0 tells you whether the observed statistic lands in the rejection region for your chosen α.

  • ci_var_ratio is a (Monte Carlo) confidence interval for σ₁² / σ₂².

A practical reading#

  • If p_value < α: the data are inconsistent with equal variances (under the Normality assumption).

  • If p_value α: you do not have enough evidence to claim the variances differ.

A helpful companion to the p-value is the confidence interval:

  • If the CI for σ₁² / σ₂² includes 1, equal variances are plausible.

  • If the CI is entirely above 1, variance 1 is likely larger.

  • If the CI is entirely below 1, variance 1 is likely smaller.

8) Pitfall demo: heavy tails can break the test#

Even when the true variances are equal, non-normal data can cause the F-test to reject too often.

Below we compare the empirical false positive rate (Type I error) under:

  • Normal data (assumptions match)

  • Heavy-tailed data (Student-t with df=3, scaled to variance 1)

alpha = 0.05
n1, n2 = 20, 20
df1, df2 = n1 - 1, n2 - 1

# Critical region under the F(df1, df2) model (computed once)
rng_crit = np.random.default_rng(0)
f_crit_sim = f_rvs(df1, df2, size=600_000, rng=rng_crit)
q_low, q_high = np.quantile(f_crit_sim, [alpha / 2.0, 1.0 - alpha / 2.0])

n_rep = 8_000

# Case A: Normal (variance 1)
x_norm = rng.normal(0.0, 1.0, size=(n_rep, n1))
y_norm = rng.normal(0.0, 1.0, size=(n_rep, n2))
f_norm = x_norm.var(axis=1, ddof=1) / y_norm.var(axis=1, ddof=1)
reject_norm = np.mean((f_norm < q_low) | (f_norm > q_high))

# Case B: Heavy-tailed t(df=3), scaled to variance 1
# Var(t_df) = df/(df-2) for df>2 -> 3/(1)=3, so divide by sqrt(3) to get variance ~1
x_t = rng.standard_t(df=3, size=(n_rep, n1)) / np.sqrt(3.0)
y_t = rng.standard_t(df=3, size=(n_rep, n2)) / np.sqrt(3.0)
f_t = x_t.var(axis=1, ddof=1) / y_t.var(axis=1, ddof=1)
reject_t = np.mean((f_t < q_low) | (f_t > q_high))

fig = px.bar(
    x=['Normal', 'Student-t df=3 (heavy-tailed)'],
    y=[reject_norm, reject_t],
    title='Type I error inflation when Normality is violated',
    labels={'x': 'data generating distribution (true variances equal)', 'y': f'fraction rejected (α={alpha})'},
)
fig.add_hline(y=alpha, line_dash='dash', line_color='black')
fig.update_layout(yaxis_range=[0, max(alpha * 2.5, reject_t * 1.15)])
fig.show()

reject_norm, reject_t
(0.047375, 0.292375)

Takeaway#

If your data are plausibly non-normal or contain outliers, the classic F-test can be misleading.

In that case, look at robust variance tests (Levene / Brown–Forsythe / Fligner–Killeen) and always pair any test with plots (box/violin, histograms, QQ-plots).

9) Connection to ANOVA / regression (why F shows up everywhere)#

In ANOVA and linear regression, an F-statistic typically compares two variance estimates:

  • a variance explained by a model/effect (signal)

  • a residual variance not explained (noise)

That ratio also follows an F distribution under suitable assumptions. So even though this notebook focuses on the variance-ratio test, the same distribution is doing a lot of work across classical statistics.

Exercises#

  1. Swap the groups (x and y) in the worked examples. What happens to f_obs? What happens to a two-sided p-value?

  2. Increase sample sizes and observe how the F distribution tightens around 1 under H₀.

  3. Create data with a few extreme outliers and see how quickly the test starts rejecting.

References#

  • F distribution (definition as ratio of chi-square variables)

  • Variance ratio confidence interval derived from F quantiles

  • Robust alternatives: Levene, Brown–Forsythe, Fligner–Killeen